Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread #11546

Closed
wants to merge 5 commits into from

Conversation

cenyuhai
Copy link
Contributor

@cenyuhai cenyuhai commented Mar 6, 2016

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.

@JoshRosen
Copy link
Contributor

Whoops, my bad: I misunderstood this patch because it's missing a description. In the future, please submit a more complete description with your PR.


if (releasedLocks.nonEmpty) {
val errMsg =
s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...which would tend to argue that this should be named unreleasedLocks, not releasedLocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhamstra These codes are from #10705 by JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen In my production environment, when the storage memory is full, there is a great probability of deadlock. It is a temporary patch because JoshRosen add a read/write lock for block in #10705 for Spark 2.0.

Two theads are removing the same block which result in deadlock.
BlockManager will first lock MemoryManager and wait to lock BlockInfo in function dropFromMemory, Execturo task lock BlockInfo and wait to lock MemoryManager calling memstore.remove(block) in function removeBlock or function removeOldBlocks.

So just use a ConcurrentHashMap to record the locks by tasks. In case of failure, release all lock after task complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhamstra this is a little confusing but it's basically saying that all the locks are supposed to be released by now, so if we have to release any locks here (i.e. if releasedLocks.nonEmpty) then something it wrong.

@cenyuhai cenyuhai changed the title [SPARK-13566] Avoid deadlock between BlockManager and Executor Thread [SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread Mar 22, 2016
zzcclp added a commit to zzcclp/spark that referenced this pull request Mar 30, 2016
@jamesecahill
Copy link

Any word on if this, or any other fix for SPARK-13566 will make it into 1.6.2?

@cenyuhai
Copy link
Contributor Author

@jamesecahill I don't know whether @JoshRosen will provide any other patch for this issue. But I have fixed this bug in my production environment by this PR.

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57824 has finished for PR 11546 at commit aaa6a96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1051,11 +1058,13 @@ private[spark] class BlockManager(
}
blockIsUpdated = true
}
pendingToRemove.put(blockId, currentTaskAttemptId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This put should happen as soon as we check containsKey in L1035. Between now and then many things could happen; another thread may find that containsKey is false at the same time and both threads can still end up trying to drop the block.

@andrewor14
Copy link
Contributor

@cenyuhai Thanks for fixing this issue. The two offending locks here seem to be info and memoryManager, which wrap each other in one order in removeBlock and the opposite order in evictBlocksToFreeSpace. I believe your fix here is largely correct but still not completely safe.

Once you address the comments I left I will go ahead and merge this.

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #57975 has finished for PR 11546 at commit a096afa.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Merge remote-tracking branch 'remotes/apache/branch-1.6' into SPARK-13566

# Conflicts:
#	core/src/main/scala/org/apache/spark/storage/BlockManager.scala
@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #57982 has finished for PR 11546 at commit 27fd070.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cenyuhai
Copy link
Contributor Author

cenyuhai commented May 6, 2016

@andrewor14 I alter the code as what you said, but the test failed because of timeout. It seems like that it is none of my business...

@cenyuhai
Copy link
Contributor Author

cenyuhai commented May 6, 2016

ok to test

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

The changes themselves LGTM. I'm not sure why it timed out so let's try again.

@andrewor14
Copy link
Contributor

There was an incorrect merge conflict a3aa22a in the scheduler so maybe that is related.

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #58014 has finished for PR 11546 at commit 27fd070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Thanks, I'm merging this into branch-1.6.

asfgit pushed a commit that referenced this pull request May 6, 2016
…Thread

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.

Author: cenyuhai <[email protected]>

Closes #11546 from cenyuhai/SPARK-13566.
@andrewor14
Copy link
Contributor

@cenyuhai this is now merged. Can you close it?

@cenyuhai cenyuhai closed this May 7, 2016
zzcclp added a commit to zzcclp/spark that referenced this pull request May 9, 2016
zzcclp pushed a commit to zzcclp/spark that referenced this pull request May 9, 2016
…Thread

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.

Author: cenyuhai <[email protected]>

Closes apache#11546 from cenyuhai/SPARK-13566.

(cherry picked from commit ab00652)
Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/spark that referenced this pull request Jul 25, 2016
…Thread

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.

Author: cenyuhai <[email protected]>

Closes apache#11546 from cenyuhai/SPARK-13566.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants